The analysis examines trends in Business Analytics, Data Science, and Machine Learning job postings, with a focus on the skills required for these roles. The study evaluates how varying skill combinations influence salary levels, remote work availability, and career progression pathways.
This analysis applies three machine learning approaches to job posting data: clustering to group roles by skill requirements, regression to examine how skills and experience influence salary, and classification to distinguish ML/Data Science positions from Business Analytics and other jobs. Using 25 technical skills along with experience and remote work indicators, the analysis shows that Business Analytics dominates the market (35% of roles), while ML and DS remain smaller but specialized segments. Results highlight that experience is the strongest salary driver, jobs fall into six clear clusters with different pay and remote work patterns, and BA, ML, and DS roles each display distinct skill signatures that make them easy to differentiate
2 Data Loading and Setup
The analysis starts by loading the Lightcast job postings dataset and identifying relevant skill columns. The dataset contains comprehensive information about job postings including titles, salaries, required skills, and other job characteristics.
Code
import pandas as pdimport numpy as npimport plotly.express as pximport plotly.graph_objects as gofrom plotly.subplots import make_subplotsimport plotly.io as pioimport jsonimport refrom collections import Counterpio.templates.default ="plotly_white"pio.renderers.default ="notebook"# Load data from csvdf = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)print(f"Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")# print(df.head())
Dataset loaded: 72,498 rows, 131 columns
2.1 Important Skills columns
The dataset contains multiple skill-related columns. After examining the schema, the columns ‘SKILLS_NAME’, ‘SOFTWARE_SKILLS_NAME’ and ‘SPECIALIZED_SKILLS_NAME’ provide the most detailed skill information for this analysis. These columns list the specific technical skills mentioned in each job posting.
3 Skills Data Preprocessing
The next step involves filtering the data to include only records with valid salary and title information. Then, binary features are created for 25 key technical skills covering ML, Data Science, and Business Analytics domains to enable machine learning analysis.
Code
# Apply filtersdf_filtered = df.dropna(subset=['SALARY', 'TITLE'])# Convert salary to numeric and filterdf_filtered['SALARY'] = pd.to_numeric(df_filtered['SALARY'], errors='coerce')df_filtered = df_filtered[df_filtered['SALARY'] >0]print(f"Records after filtering: {len(df_filtered):,}")df_skills = df_filtered.copy()# Focus on key Business Analytics/ML/Data Science skills. Key skills for# BA/ML/DS roles identified manually.key_skills = ['Python (Programming Language)','R (Programming Language)','SQL (Programming Language)','Machine Learning','Data Science','Data Analysis','Statistics','Artificial Intelligence','TensorFlow','PyTorch (Machine Learning Library)','Pandas (Python Package)','NumPy (Python Package)','Scikit-Learn (Python Package)','Big Data','Apache Spark','Apache Hadoop','Amazon Web Services','Microsoft Azure','Google Cloud Platform (Gcp)','Data Visualization','Tableau (Business Intelligence Software)','Power BI','Natural Language Processing (NLP)','Computer Vision','Deep Learning' ]print(f"Using focused {len(key_skills)} BA/ML/DS technical skills for analysis")# Create binary features for each key skill.for skill in key_skills:# Clean skill name for column naming# Eg: R (Programming Language) --> has_r_programming_language skill_col_name =f'has_{skill.lower().replace(" ", "_").replace("-", "_").replace("(", "").replace(")", "")}' df_skills[skill_col_name] = ( df_skills['SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) | df_skills['SOFTWARE_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) | df_skills['SPECIALIZED_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) ).astype(int)print("Binary skill features created")# Create ML/DS role indicator using focused skillscore_ml_skills = ['has_machine_learning', 'has_artificial_intelligence', 'has_tensorflow', 'has_pytorch_machine_learning_library','has_deep_learning', 'has_natural_language_processing_nlp', 'has_computer_vision']core_ds_skills = ['has_python_programming_language', 'has_r_programming_language', 'has_statistics','has_data_science', 'has_pandas_python_package', 'has_numpy_python_package','has_scikit_learn_python_package', 'has_big_data']core_ba_skills = ['has_data_analysis', 'has_data_visualization', 'has_sql_programming_language','has_tableau_business_intelligence_software', 'has_power_bi']# Role indicators# ML roles are straightforward.df_skills['is_ml_role'] = ( (df_skills[core_ml_skills].sum(axis=1) >0)).astype(int)# R language is primarily associated with Data Science field. So,# if job requires R language or if it has more than one data science# skills then it is considered DS role.df_skills['is_ds_role'] = ( df_skills['has_r_programming_language'] ==1| (df_skills[core_ds_skills].sum(axis=1) >1)).astype(int)# Business Analytics roles typically require SQL, visualization tools (Tableau, Power BI)# and data analysis capabilities. If job has more than two BA skills, consider it a BA role.df_skills['is_ba_role'] = ( df_skills[core_ba_skills].sum(axis=1) >=2).astype(int)# Remote work indicatordf_skills['is_remote'] = df_skills['REMOTE_TYPE'].fillna(0).astype(int)df_skills['experience_years'] = df_skills['MIN_YEARS_EXPERIENCE'].fillna(0)df_final = df_skillsprint(f"Final dataset size: {len(df_final):,}")print(f"ML roles identified: {df_final['is_ml_role'].sum():,}")print(f"Data Science roles identified: {df_final['is_ds_role'].sum():,}")print(f"Business Analytics roles identified: {df_final['is_ba_role'].sum():,}")
Records after filtering: 30,808
Using focused 25 BA/ML/DS technical skills for analysis
Binary skill features created
Final dataset size: 30,808
ML roles identified: 3,226
Data Science roles identified: 2,877
Business Analytics roles identified: 10,831
For each of the 25 key skills, a binary indicator variable is created (1 if the skill is mentioned, 0 otherwise). This transforms the text skill data into numerical features suitable for machine learning models.
3.1 Role Classification Logic
Three role categories are identified based on technical skills:
ML roles: Require advanced ML/AI skills like TensorFlow, PyTorch, Deep Learning, NLP, Computer Vision
Data Science roles: Require R programming, Python with Statistics, or multiple data science tools (Pandas, NumPy, Scikit-learn)
Business Analytics roles: Require SQL, data analysis, visualization tools (Tableau, Power BI), typically 2+ BA skills
The analysis examines how these specialized skills impact salary and career opportunities. Machine learning models are used to find patterns that can guide job seekers in choosing which skills to develop.
4 Feature Engineering for ML
Before building models, the dataset is prepared by selecting relevant columns. This includes the salary (target variable), skill indicators, remote work status, and experience years.
Code
# Just prepare the modeling datasetmodeling_cols = ['SALARY', 'is_ml_role', 'is_ds_role', 'is_ba_role', 'is_remote', 'experience_years'] +\ [col for col in df_final.columns if col.startswith('has_')]df_modeling = df_final[modeling_cols].copy()print("Features for modeling:")print(f"Dataset shape: {df_modeling.shape}")print(f"Columns: {list(df_modeling.columns)}")print(f"Missing values: {df_modeling.isnull().sum().sum()}")
The modeling dataset now contains binary skill features, experience, remote work indicator, and salary information. This structured format allows application of various machine learning techniques.
5 Unsupervised Learning:
5.1 KMeans Clustering Based on Skills
The first machine learning approach uses KMeans clustering to discover natural groupings in the job market. This unsupervised technique groups jobs with similar skill profiles together, without using salary information. The goal is to see if jobs naturally segment into distinct categories based on their requirements.
Code
from sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler, LabelEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.ensemble import RandomForestRegressor, RandomForestClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import mean_squared_error, r2_score, accuracy_score, f1_score, confusion_matrix, classification_report# Prepare features for clustering using skills and other featuresskill_feature_cols = [col for col in df_modeling.columns if col.startswith('has_')]print(f"Available skill features: {len(skill_feature_cols)}")# Base clustering featuresclustering_features = skill_feature_cols + ['experience_years', 'is_remote']# Encode ONET and NAICS6.le_onet = LabelEncoder()df_modeling['onet_encoded'] = le_onet.fit_transform(df_final['ONET'].fillna('Unknown'))clustering_features.append('onet_encoded')le_naics = LabelEncoder()df_modeling['naics_encoded'] = le_naics.fit_transform(df_final['NAICS6'].fillna('Unknown'))clustering_features.append('naics_encoded')# Prepare clustering dataX_cluster = df_modeling[clustering_features].fillna(0)# Scale featuresscaler_cluster = StandardScaler()X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)# KMeans clusteringkmeans = KMeans(n_clusters=6, random_state=42, n_init=10)clusters = kmeans.fit_predict(X_cluster_scaled)df_modeling['cluster'] = clusters# print("Skills based clustering completed")# print("Cluster centers:")# for i, center in enumerate(kmeans.cluster_centers_):# print(f"Cluster {i}: {center}")
Available skill features: 25
The clustering model groups similar jobs together using skill patterns, experience requirements, and job characteristics. The algorithm assigns each job to one of 6 clusters. Now the characteristics of each cluster can be examined to understand what makes them distinct.
Code
# Analyze clustering.cluster_summary = df_modeling.groupby('cluster').agg({'SALARY': ['count', 'mean'],'is_ml_role': 'mean','is_ds_role': 'mean','is_ba_role': 'mean','is_remote': 'mean','experience_years': 'mean'}).round(2)cluster_summary.columns = ['count', 'avg_salary', 'ml_role_pct', 'ds_role_pct', 'ba_role_pct','remote_percentage', 'avg_experience']cluster_summary = cluster_summary.reset_index()# Compute combined BA/ML/DS percentage on-the-fly# A job has BA/ML/DS if it has any of the three role typescluster_summary['ml_ds_ba_combined_pct'] = cluster_summary.apply(lambda row: ((df_modeling[df_modeling['cluster'] == row['cluster']][['is_ml_role', 'is_ds_role', 'is_ba_role']].sum(axis=1) >0).mean()), axis=1).round(2)print("Skills based Cluster Summary:")print(cluster_summary)# Visualize cluster characteristics.fig = make_subplots( rows=2, cols=3, subplot_titles=('Cluster Size', 'Average Salary', 'BA/ML/DS Role %','Remote Work %', 'Avg Experience', 'Salary Distribution'), specs=[[{"type": "bar"}, {"type": "bar"}, {"type": "bar"}], [{"type": "bar"}, {"type": "bar"}, {"type": "scatter"}]])fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['count'], name="Count"), row=1, col=1)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['avg_salary'], name="Avg Salary"), row=1, col=2)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ml_role_pct'], name="ML %"), row=1, col=3)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ds_role_pct'], name="DS %"), row=1, col=3)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ba_role_pct'], name="BA %"), row=1, col=3)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['remote_percentage'], name="Remote %"), row=2, col=1)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['avg_experience'], name="Experience"), row=2, col=2)# Salary distribution by cluster.fig.add_trace( go.Scatter( x=df_modeling['cluster'], y=df_modeling['SALARY'], mode='markers', opacity=0.6, name="Jobs" ), row=2, col=3)fig.update_layout( height=650, showlegend=False, template="plotly_white", title={'text': "Skills-Based KMeans Clustering Results",'y': 0.98,'x': 0.5,'xanchor': 'center','yanchor': 'top' }, margin=dict(t=80))fig.show()
The clustering analysis grouped jobs based on their skill requirements and characteristics. The analysis identified 6 distinct job clusters, each with different salary levels, remote work availability, and skill profiles.
Key Findings:
Business Analytics dominates: 10,831 BA roles vs. 3,226 ML and 2,877 DS
BA-focused growth: Cluster 3 ($109K) — strong BA demand with DS hybrid edge
Specialist track: Cluster 4 ($140K) — pure ML, fewer jobs but high pay
Hybrid advantage: Cluster 0 ($140K) and Cluster 5 ($118K, 56% remote) — multi-skill roles with flexibility
6 Supervised Learning:
6.1 Multiple Regression
The second approach uses supervised learning to predict salary based on skills and experience. Two regression models are trained: Multiple Linear Regression and Random Forest. This analysis identifies which skills and factors most strongly influence compensation.
Code
# Regression features.# Focus on skills (not role labels) to understand how skills directly affect salaryregression_features = skill_feature_cols + ['experience_years', 'is_remote']# Preparing regression data using salary as the target variableX_reg = df_modeling[regression_features].fillna(0)y_reg = df_modeling['SALARY']X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)print(f"Training set size: {len(X_train):,}")print(f"Test set size: {len(X_test):,}")# Scale featuresscaler_reg = StandardScaler()X_train_scaled = scaler_reg.fit_transform(X_train)X_test_scaled = scaler_reg.transform(X_test)# Multiple Linear Regressionlr = LinearRegression()lr.fit(X_train_scaled, y_train)# Random Forest Regressionrf_reg = RandomForestRegressor(n_estimators=100, random_state=42)rf_reg.fit(X_train_scaled, y_train)print("Skills based regression models training completed")# Regression statistics for Multiple linear regressiony_train_pred = lr.predict(X_train_scaled)# Residuals and RSSresiduals = y_train - y_train_predrss = np.sum(residuals**2)n =len(y_train)k =len(regression_features)# Model statistics for Multiple linear regressionmse_lr = rss / nrmse_train_lr = np.sqrt(mse_lr)r2_train_lr = r2_score(y_train, y_train_pred)adj_r2_lr =1- (1- r2_train_lr) * (n -1) / (n - k -1)# AIC and BIC for Multiple linear regressionaic_lr = n * np.log(rss / n) +2* kbic_lr = n * np.log(rss / n) + k * np.log(n)log_likelihood_lr =-aic_lr/2+ k# Random Forest statisticsy_train_pred_rf = rf_reg.predict(X_train_scaled)r2_train_rf = r2_score(y_train, y_train_pred_rf)rmse_train_rf = np.sqrt(mean_squared_error(y_train, y_train_pred_rf))residuals_rf = y_train - y_train_pred_rfrss_rf = np.sum(residuals_rf**2)# Test set performancey_test_pred_lr = lr.predict(X_test_scaled)y_test_pred_rf = rf_reg.predict(X_test_scaled)r2_test_lr = r2_score(y_test, y_test_pred_lr)r2_test_rf = r2_score(y_test, y_test_pred_rf)rmse_test_lr = np.sqrt(mean_squared_error(y_test, y_test_pred_lr))rmse_test_rf = np.sqrt(mean_squared_error(y_test, y_test_pred_rf))# Regression statistics tableprint("\n=== REGRESSION MODEL STATISTICS ===\n")regression_stats = pd.DataFrame({'Statistic': ['Intercept','Number of Features','Number of Observations (Train)','R-squared (Training)','R-squared (Test)','Adjusted R-squared','RMSE (Training)','RMSE (Test)','RSS (Residual Sum of Squares)','MSE (Mean Squared Error)','AIC','BIC','Log-Likelihood' ],'Multiple linear regression': [f"{lr.intercept_:.4f}",f"{k}",f"{n:,}",f"{r2_train_lr:.4f}",f"{r2_test_lr:.4f}",f"{adj_r2_lr:.4f}",f"${rmse_train_lr:,.2f}",f"${rmse_test_lr:,.2f}",f"{rss:,.2f}",f"{mse_lr:,.2f}",f"{aic_lr:.2f}",f"{bic_lr:.2f}",f"{log_likelihood_lr:.2f}" ],'Random Forest': ['N/A',f"{k}",f"{n:,}",f"{r2_train_rf:.4f}",f"{r2_test_rf:.4f}",'N/A',f"${rmse_train_rf:,.2f}",f"${rmse_test_rf:,.2f}",f"{rss_rf:,.2f}",f"{rmse_train_rf**2:,.2f}",'N/A*','N/A*','N/A*' ]})print(regression_stats.to_string(index=False))print("\n* AIC/BIC/Log-Likelihood are only applicable to parametric linear models")print("\nNote: R-squared (Test) shows model performance on unseen data")print("\n=== FEATURE COEFFICIENTS / IMPORTANCE COMPARISON ===\n")# Sanity check for alignmentassertlen(regression_features) ==len(lr.coef_) ==len(rf_reg.feature_importances_), \"Mismatch between features, LR coefficients, and RF importances!"# Combine features, coefficients, and importancescoef_comparison = pd.DataFrame({'Feature': regression_features,'LR_Coefficient': lr.coef_,'RF_Importance': rf_reg.feature_importances_})# Filter out zero coefficientscoef_comparison = coef_comparison[coef_comparison['LR_Coefficient'] !=0.0]# Adding impact type (positive or negative only)coef_comparison['Impact'] = coef_comparison['LR_Coefficient'].apply(lambda x: 'Positive'if x >0else'Negative')coef_comparison['LR_Coefficient'] = coef_comparison['LR_Coefficient'].round(4)coef_comparison['RF_Importance'] = coef_comparison['RF_Importance'].round(4)# Positive coefficients (highest impact)top_positive = ( coef_comparison[coef_comparison['Impact'] =='Positive'] .sort_values(by='LR_Coefficient', ascending=False) .head(15))print("Top 15 Features by Multiple linear regression Coefficient (Positive Impact):")print(top_positive[['Feature', 'LR_Coefficient', 'RF_Importance']].to_string(index=False))# Negative coefficients (lowest impact)top_negative = ( coef_comparison[coef_comparison['Impact'] =='Negative'] .sort_values(by='LR_Coefficient', ascending=True) .head(15))print("\nTop 15 Features by Multiple linear regression Coefficient (Negative Impact):")print(top_negative[['Feature', 'LR_Coefficient', 'RF_Importance']].to_string(index=False))print(f"\nMultiple linear regression Intercept: {lr.intercept_:.4f}")print("\nNote:")print("- LR Coefficients show the direction and strength of linear relationships with the target.")print("- Positive coefficients increase predicted salary; negative coefficients decrease it.")print("- RF Importance reflects how much each feature contributes to model accuracy (non-linear).")
Training set size: 24,646
Test set size: 6,162
Skills based regression models training completed
=== REGRESSION MODEL STATISTICS ===
Statistic Multiple linear regression Random Forest
Intercept 117744.2020 N/A
Number of Features 27 27
Number of Observations (Train) 24,646 24,646
R-squared (Training) 0.2678 0.5237
R-squared (Test) 0.2780 0.4672
Adjusted R-squared 0.2670 N/A
RMSE (Training) $38,730.71 $31,238.34
RMSE (Test) $37,899.01 $32,558.54
RSS (Residual Sum of Squares) 36,970,667,997,179.88 24,050,395,305,765.03
MSE (Mean Squared Error) 1,500,067,678.21 975,833,616.24
AIC 520793.81 N/A*
BIC 521012.85 N/A*
Log-Likelihood -260369.91 N/A*
* AIC/BIC/Log-Likelihood are only applicable to parametric linear models
Note: R-squared (Test) shows model performance on unseen data
=== FEATURE COEFFICIENTS / IMPORTANCE COMPARISON ===
Top 15 Features by Multiple linear regression Coefficient (Positive Impact):
Feature LR_Coefficient RF_Importance
experience_years 17850.7023 0.4932
has_python_programming_language 5665.9793 0.0300
has_amazon_web_services 3354.0141 0.0361
has_big_data 3255.5115 0.0257
has_artificial_intelligence 1905.1339 0.0239
has_microsoft_azure 1587.5365 0.0192
has_machine_learning 1501.0951 0.0265
has_data_science 1297.6624 0.0261
has_pandas_python_package 934.5885 0.0042
has_scikit_learn_python_package 733.9453 0.0003
has_deep_learning 404.1220 0.0023
has_google_cloud_platform_gcp 213.1297 0.0084
is_remote 201.2772 0.0728
has_computer_vision 130.9562 0.0006
has_apache_hadoop 70.5746 0.0075
Top 15 Features by Multiple linear regression Coefficient (Negative Impact):
Feature LR_Coefficient RF_Importance
has_data_analysis -7222.0464 0.0426
has_r_programming_language -2811.3382 0.0193
has_power_bi -1737.9884 0.0232
has_statistics -1298.4926 0.0302
has_sql_programming_language -1087.0998 0.0350
has_pytorch_machine_learning_library -801.3560 0.0005
has_tableau_business_intelligence_software -789.6728 0.0372
has_numpy_python_package -635.2314 0.0004
has_data_visualization -524.2779 0.0249
has_natural_language_processing_nlp -260.2604 0.0019
has_tensorflow -256.5688 0.0005
has_apache_spark -248.2513 0.0073
Multiple linear regression Intercept: 117744.2020
Note:
- LR Coefficients show the direction and strength of linear relationships with the target.
- Positive coefficients increase predicted salary; negative coefficients decrease it.
- RF Importance reflects how much each feature contributes to model accuracy (non-linear).
Both models are trained on 80% of the data and will be evaluated on the remaining 20% test set. The Random Forest model can capture non-linear relationships and interactions between skills, while Multiple Linear Regression provides a baseline for comparison.
Code
# Test metrics already calculated abover2_lr = r2_test_lrr2_rf = r2_test_rfrmse_lr = rmse_test_lrrmse_rf = rmse_test_rfy_pred_rf = y_test_pred_rfprint("Skills-based Regression Model Performance (Test Set):")print(f"Multiple Linear Regression - RMSE: ${rmse_lr:,.2f}, R²: {r2_lr:.4f}")print(f"Random Forest - RMSE: ${rmse_rf:,.2f}, R²: {r2_rf:.4f}")# Feature importance for Random Forest#Features that actually exist in the modelactual_feature_names = [col for col in regression_features if col in X_train.columns]importances = rf_reg.feature_importances_# Visualize feature importancefig = px.bar(x=actual_feature_names, y=importances, title="Skills Impact on Salary (Random Forest Feature Importance)", labels={'x': 'Features', 'y': 'Importance'})fig.update_layout(template="plotly_white", xaxis_tickangle=-45)fig.show()# Top skills by salary impactskill_importance =list(zip(actual_feature_names, importances))skill_importance.sort(key=lambda x: x[1], reverse=True)print("\nTop skills by salary impact:")for skill, importance in skill_importance[:10]:print(f"{skill}: {importance:.4f}")
Skills-based Regression Model Performance (Test Set):
Multiple Linear Regression - RMSE: $37,899.01, R²: 0.2780
Random Forest - RMSE: $32,558.54, R²: 0.4672
Prediction models were built to understand how skills influence salary. The Random Forest model achieved R2 of 0.47 compared to 0.28 for Multiple Linear Regression, showing that skill-salary relationships are complex.
Model Performance:
Random Forest: R² = 0.47 (explains 47% of salary variation), RMSE = $32,559
Multiple Linear Regression: R² = 0.28
Insight: Skills alone do not fully explain salary — other factors also matter.
Key Salary Drivers (Feature Importance):
Experience (0.49): Largest factor, nearly half of salary variation
Remote work (0.07): Flexibility influences pay differences
Data Analysis (0.04): Core analytical capability
Tableau (0.04): Visualization and BI tool
AWS (0.04): Cloud computing platform
SQL (0.04): Database querying and manipulation
Statistics (0.03): Analytical foundation
Python (0.03): Programming language
Career Implications:
Experience is critical — the strongest driver of salary.
Remote work adds value — flexibility can boost compensation.
Skill combinations matter — technical, analytical, and cloud skills together shape salary outcomes.
Summary: Salary is not determined by skills alone. Experience and work flexibility are key, while technical skills provide additional differentiation.
6.2 Classification to Identify BA/ML/DS Roles
Although the project required only one of the supervised learning models. This analysis also explores the classification to distinguish ML/Data Science roles from Business Analytics and other positions. A Random Forest Classifier is trained to predict whether a job is an ML/DS role based on its skill requirements. This analysis reveals which skills are the strongest “signature” indicators that distinguish ML/DS positions from BA roles.
Code
# Prepare features for classification.classification_features = skill_feature_cols + ['experience_years', 'is_remote']# Classification dataX_clf = df_modeling[classification_features].fillna(0)# Target: ML/DS roles (computed from is_ml_role OR is_ds_role)y_clf = ((df_modeling['is_ml_role'] ==1) | (df_modeling['is_ds_role'] ==1)).astype(int)# Train/test split for classificationX_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)# Scale featuresscaler_clf = StandardScaler()X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)X_test_clf_scaled = scaler_clf.transform(X_test_clf)# Random Forest Classificationrf_clf = RandomForestClassifier(n_estimators=100, random_state=42)rf_clf.fit(X_train_clf_scaled, y_train_clf)print("Skills-based classification model trained successfully!")
Skills-based classification model trained successfully!
The classifier learns patterns that distinguish ML/DS roles from BA and other positions based on their skill profiles. The model is now evaluated to see how accurately it can identify these specialized ML/DS roles versus the more common BA positions.
Code
# Random Forest predictionsy_pred_rf_clf = rf_clf.predict(X_test_clf_scaled)accuracy_rf = accuracy_score(y_test_clf, y_pred_rf_clf)f1_rf = f1_score(y_test_clf, y_pred_rf_clf)print("Skills based Classification Model Performance:")print(f"Random Forest - Accuracy: {accuracy_rf:.4f}, F1 Score: {f1_rf:.4f}")# Confusion Matrix for Random Forestcm = confusion_matrix(y_test_clf, y_pred_rf_clf)# Visualize confusion matrixfig = px.imshow(cm, text_auto=True, aspect="auto", title="Confusion Matrix - ML/DS Role Classification", labels=dict(x="Predicted", y="Actual"), color_continuous_scale="Blues")fig.update_layout(template="plotly_white")fig.update_xaxes(tickvals=[0,1], ticktext=['Not ML/DS', 'ML/DS'])fig.update_yaxes(tickvals=[0,1], ticktext=['Not ML/DS', 'ML/DS'])fig.show()print("Classification Report:")print(classification_report(y_test_clf, y_pred_rf_clf))# Features that actually exist in the classification modelclf_actual_feature_names = [col for col in classification_features if col in X_train_clf.columns]clf_importances = rf_clf.feature_importances_# Visualize classification feature importancefig = px.bar(x=clf_actual_feature_names, y=clf_importances, title="Skills Impact on ML/Data Science Role Classification", labels={'x': 'Features', 'y': 'Importance'})fig.update_layout(template="plotly_white", xaxis_tickangle=-45)fig.show()
Skills based Classification Model Performance:
Random Forest - Accuracy: 0.9995, F1 Score: 0.9986
A Random Forest classifier was used to predict whether a job is an ML/Data Science role based on its skill requirements. The model achieved very strong performance in separating ML/DS roles from Business Analytics and other positions.
Model Performance:
Accuracy: 99.95% — nearly all ML/DS roles correctly identified
Insight: ML/DS roles have distinct skill patterns compared to BA and general analyst jobs
Conclusion: Skill-based criteria effectively distinguish ML/DS roles from BA positions
Key Predictive Skills (Feature Importance)
Programming: Python, R
ML Frameworks: TensorFlow, PyTorch
Statistical Modeling: Core differentiator for ML/DS
BA-Oriented Skills: SQL, Tableau, Power BI, Data Analysis (more common in BA roles)
Career Implications
Distinct skill sets: ML/DS roles require clearly different capabilities than BA roles
ML/DS focus: Programming, modeling, and ML frameworks are the strongest signals
BA focus: SQL, visualization, and reporting tools dominate BA roles
Career development: Building expertise in high-importance ML/DS features directly improves readiness for ML/DS positions
Summary:The Random Forest classifier confirms that ML/DS roles are defined by specialized technical skills, while BA roles emphasize analysis and visualization tools. This distinction provides a clear roadmap for professionals aiming to transition into ML/DS careers.
7 Model Results Visualization
This section provides a consolidated view of the regression modeling approaches. The comparison shows how different models perform on salary prediction and highlights the most impactful skills across different analyses.
Code
# Model performancemodel_summary = pd.DataFrame({'Model': ['Multiple Linear Regression', 'Random Forest (Regression)'],'R² (Test)': [r2_lr, r2_rf],'RMSE (Test)': [rmse_lr, rmse_rf]})print(model_summary)# Visualization of model resultsfig = make_subplots( rows=2, cols=2, subplot_titles=('R² Comparison (Test Set)', 'RMSE Comparison (Test Set)','Skills vs Salary Impact', 'Predicted vs Actual Salary'), specs=[[{"type": "bar"}, {"type": "bar"}], [{"type": "bar"}, {"type": "scatter"}]])# Row 1, Col 1: R² comparisonmodels = ['Multiple linear regression', 'Random Forest']r2_values = [r2_lr, r2_rf]fig.add_trace(go.Bar(x=models, y=r2_values, name="R² Score", marker_color=['steelblue', 'darkgreen']), row=1, col=1)# Row 1, Col 2: RMSE comparisonrmse_values = [rmse_lr, rmse_rf]fig.add_trace(go.Bar(x=models, y=rmse_values, name="RMSE", marker_color=['coral', 'orange']), row=1, col=2)# Row 2, Col 1: Skills vs salary impact (top 10)top_skills_salary = skill_importance[:10]fig.add_trace(go.Bar( x=[s[1] for s in top_skills_salary], y=[s[0] for s in top_skills_salary], orientation='h', name="Feature Importance", marker_color='purple'), row=2, col=1)# Row 2, Col 2: Predicted vs Actual for Random Forestsample_size =min(500, len(y_test))sample_indices = np.random.choice(len(y_test), sample_size, replace=False)fig.add_trace(go.Scatter( x=y_test.iloc[sample_indices], y=y_pred_rf[sample_indices], mode='markers', name='RF Predictions', marker=dict(color='darkgreen', size=5, opacity=0.6)), row=2, col=2)# Prediction linemin_val =min(y_test.min(), y_pred_rf.min())max_val =max(y_test.max(), y_pred_rf.max())fig.add_trace(go.Scatter( x=[min_val, max_val], y=[min_val, max_val], mode='lines', name='Perfect Prediction', line=dict(color='red', dash='dash')), row=2, col=2)# Axis labelsfig.update_xaxes(title_text="Model", row=1, col=1)fig.update_yaxes(title_text="R² Score", row=1, col=1)fig.update_xaxes(title_text="Model", row=1, col=2)fig.update_yaxes(title_text="RMSE ($)", row=1, col=2)fig.update_xaxes(title_text="Importance", row=2, col=1)fig.update_yaxes(title_text="Feature", row=2, col=1)fig.update_xaxes(title_text="Actual Salary ($)", row=2, col=2)fig.update_yaxes(title_text="Predicted Salary ($)", row=2, col=2)fig.update_layout( height=800, showlegend=False, template="plotly_white", title={'text': "Regression Model Comparison - BA/ML/DS Salary Prediction",'y': 0.98,'x': 0.5,'xanchor': 'center','yanchor': 'top', }, margin=dict(t=80))fig.show()
Model R² (Test) RMSE (Test)
0 Multiple Linear Regression 0.278032 37899.005358
1 Random Forest (Regression) 0.467166 32558.537199
8 Key Takeaways and Recommendations
8.1 Summary of Findings
Our analysis of business analytics, data science and machine learning job postings reveals several important patterns:
Role Distribution: Business Analytics dominates (35% of jobs), while ML and DS remain smaller but specialized segments.
Job Segmentation: Six distinct clusters reveal clear differences in pay, experience, and hybrid skill mixes.
Salary Drivers: Experience is the strongest factor (49%), with remote work and technical skills adding incremental impact.
Role Differentiation: ML/DS roles are highly distinct, with classification accuracy of 99.95% separating them from BA roles.
8.2 Recommendations for Job Seekers
For Career Advancement:
Gain experience - it’s the single biggest salary driver (49% importance)
Remote work flexibility - BA/ML/DS roles pay well even when remote, showing that onsite presence is not necessary for competitive salaries.